Zero-Shot Learning

Zero-shot learning focuses on the relation between visual features X, semantic embeddings A, and category labels Y. Based on the approach, existing zero-shot learning works can be roughly categorized into the following groups:

1) semantic relatedness: X->Y (semantic similarity; write classifier)

2) semantic embedding: X->A->Y (map from X to A; map from A to X; map between A and X into common space)

Based on the setting, existing zero-shot learning works can be roughly categorized into the following groups:

1) inductive ZSL (do not use unlabeled test images in the training stage) v.s. semi-supervised/transductive ZSL (use unlabeled test images in the training stage)

2) standard ZSL (test images only from unseen categories) v.s. generalized ZSL (test images from both seen and unseen categories) (novelty detection, calibrated stacking)

Ideas:

  1. Mapping: dictionary learning, metric learning, etc

  2. Embedding: multiple embedding [1], free embedding [1], self-defined embedding [1]

  3. Application: video->object(attribute)->action [1], image->object(attribute)->scene

  4. Combination: with active learning [1] [2], online learning [1]

  5. External knowledge graph: WordNet-based [1], NELL-based [2]

  6. Deep learning: graph neural network [1], RNN [2]

  7. Generate synthetic exemplars for unseen categories: synthetic images [SP-AEN] or synthetic features [SE-ZSL] [GAZSL] [f-xGAN]

Critical Issues:

  1. generalized ZSL, why first predict seen or unseen?: As claimed in [1], since we only see labeled data from seen classes, during training, the scoring functions of seen classes tend to dominate those of unseen classes, leading to biased predictions in GZSL and aggressively classifying a new data point into the label space of S because classifiers for the seen classes do not get trained on negative examples from the unseen classes.

  2. hubness problem [1][2]: As claimed in [2], one practical effect of the ZSL domain shift is the Hubness problem. Specifically, after the domain shift, there are a small set of hub test-class prototypes that become nearest or K nearest neighbours to the majority of testing samples in the semantic space, while others are NNs of no testing instances. This results in poor accuracy and highly biased predictions with the majority of testing examples being assigned to a small minority of classes.

  3. projection domain shift: what is the impact on the decision values?

Datasets:

  1. small-scale datasets: CUB, AwA, SUN, aPY, Dogs, FLO

  2. large-scale dataset: ImageNet

Survey and Resource:

  1. Recent Advances in Zero-Shot Recognition

  2. Zero-Shot Learning - A Comprehensive Evaluation of the Good, the Bad and the Ugly [code]

  3. List of paper and datasets

Other applications:

  1. zero-shot object detection
  2. zero-shot figure-ground segmentation [1]
  3. zero-shot semantic segmentation
  4. zero-shot retrieval
  5. zero-shot domain adaptation